\(\;\)
\(\;\)
To explore the nature of the predictive accuracy of various polynomials, we will use a subset of the o3median attribute from Chicago air pollution data from R package gamair. \(O_3\), also known as Ozone or trioxygen, is a strong oxidant with chlorine-like odour that can cause repiratory damage if inhaled. Hence, \(O_3\) is an air pollutant when it’s present where humans live.
library("gamair")
## Warning: package 'gamair' was built under R version 3.5.2
data(chicago)
n <- 500
x <- 1:n
o3.data <- data.frame(x=1:n, y=chicago$o3median[1:n])
head(chicago) # first few lines of the data set
## death pm10median pm25median o3median so2median time tmpd
## 1 130 -7.4335443 NA -19.59234 1.9280426 -2556.5 31.5
## 2 150 NA NA -19.03861 -0.9855631 -2555.5 33.0
## 3 101 -0.8265306 NA -20.21734 -1.8914161 -2554.5 33.0
## 4 135 5.5664557 NA -19.67567 6.1393413 -2553.5 29.0
## 5 126 NA NA -19.21734 2.2784649 -2552.5 32.0
## 6 130 6.5664557 NA -17.63400 9.8585839 -2551.5 40.0
par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomials of degree 2 and 10 based on the whole population (make the colour of the population curves different from the others to make them stand out).var_mutilde function, calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10bias2_mutilde function, calculate the squared bias of the polynomials with degree equal to 2 and 10.apse_all function, calculate the APSE for complexities equal to 0:10.
\(\;\)
\(\;\)
\(\;\)
Bias Variance trade-offs To explore the nature of the predictive accuracy of various polynomials. We construct a number of functions that allowed us to generate data from the same response model and use these to assess the bias-variance tradeoff of our estimators.
In this question, you are going to do the same, but now using the same mechanism used to generate the fake data used in the notes for illustrating different fits.
The fake data was constructed as follows:
# Get some x's
set.seed(341)
N <- 125
x <- runif(N, 0, 1)
x = rep(x, each=2)
mu <- function(x) { sin( 3*pi*x )/x }
# generate some ys
y <- mu(x) + rnorm(N, 0, .6)
# Here's the data
fake.data = data.frame(x=x, y=y)getmuFun function to overlay a function that has the average the \(y\) values when the corresponding \(x\) values are the same.getmuhat function.par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomial of degree 2 and 10 based on the whole population.var_mutilde calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10 .bias2_mutilde calculate the squared bias of the polynomials with degree equal to 2 and 10.apse_all calculate the APSE for complexities equal to 0:10.
\(\;\)
n=100
set.seed(341)
fake.sample = sample(nrow(fake.data), n)
sample.data = fake.data[fake.sample,]
getmuFun function overlay a function that has the average the \(y\) values when the corresponding \(x\) values are the same.k the number of k-fold, pop a population, xvarname the name of the x variable, and yvarname the of the y variable.Ssamples and Tsamples.rep_len might be helpful.apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when the complexity=2.\(\;\)
\(\;\)
\(\;\)
Bias Variance trade-offs: To explore the nature of the predictive accuracy of various polynomials, we construct a number of functions that allow us to generate data from the same response model and use these to assess the bias-variance trade-off of our estimators. In this question, you are going to do the same for the following model: \[ Y_i = 20\,\log\left(\frac{x^2+5\sin (\pi x)}{x-5}\right) + E_i~~,\quad E_i\sim_{iid} G(0,1) \]
set.seed(341)
N <- 125
# The x values
x <- runif(N, 7,8.5)
x = rep(x, each=2)
mu <- function(x) { 20*log((x^2+5*sin(pi*x))/(x-5)) }
# generate y values
y <- mu(x) + rnorm(N, 0, .2)
# And below is the simulated data
simulated.data = data.frame(x=x, y=y)
Generate the scatter plot of the data and overlay fitted polynomials with degrees 2 and 10 to the data in different colours. Use the getmuhat function. Looking at this plot, which polynomial better fits the data?
Generate \(m=100\) samples each of size \(n=40\). Fit polynomials of degree 2 and 10 to every sample. Using par(mfrow=c(1,2)) plot all 100 fitted polynomials with degree 2 and all of those with degree 10 on the two different panels. Overlay the two fitted polynomial of degree 2 and 10 based on the whole population in a different colour so that they are visible.
Using var_mutilde calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10. Compare the two numbers.
Using bias2_mutilde calculate the squared bias of the polynomials with degree equal to 2 and 10.
apse_all function, calculate the APSE for complexities equal to 0:10.
\(\;\)
n=100
set.seed(341)
simulated.sample = sample(nrow(simulated.data), n)
sample.data = simulated.data[simulated.sample,]
getmuFun function, overlay a function that has the average \(y\) values when the corresponding \(x\) values are the same.sample.kfold that creates the \(k\)-fold samples from a given population.i.e.,
k: the number of k-fold, pop: a population, xvarname: the name of the \(x\) variable, and yvarname: the of the \(y\) variable.Ssamples and Tsamples.rep_len might be helpful.Once you have the function, try
TEST = sample.kfold(k=5, pop=sample.data, "x", "y")
par(mfrow=c(1,2))
plot(TEST$Ssamples[[2]])
plot(TEST$Tsamples[[2]])
Use the function from part b) and the apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when complexity=3.
Perform \(k=5\)-fold cross-validation to estimate the complexity parameter from the set \(0:10\). Plot APSE by the complexity.
Perform \(B=100\) bootstrap by resampling the errors to quantify the uncertainly in the complexity parameter. Construct a histogram of the complexity parameters to visualize this uncertainty, and comment on the plot.
\(\;\)
\(\;\)
\(\;\)
Suppose that the Animals Data is a sample and we are interested in a linear relationship between log(Brain) and log(Body).
library(MASS)
data(Animals2)
plot(log(Animals2$body), log(Animals2$brain),
pch=19, col=adjustcolor("grey", .5),
xlab="log(body)", ylab="log(Brain)")
Fit a linear function to the given sample using robust regression and the huber function.
Using \(B=1000\) bootstrap samples by sampling the pairs and fit a linear function to each sample using robust regression and the huber function.
Repeat part b) by resampling the errors to generate the bootstrap samples.
\(\;\)
\(\;\)
\(\;\)
To explore the nature of the predictive accuracy of various polynomials, we will use some weekly average air temperatures in Cairo. Load the file cairo_temp.csv from LEARN.
cairo.temp = read.csv("cairo_temp.csv")
head(cairo.temp)
## week temp
## 1 1 59.2
## 2 2 56.5
## 3 3 55.7
## 4 4 56.1
## 5 5 58.4
## 6 6 58.5
names(cairo.temp) = c('x', 'y')
Generate the scatter plot of the data and overlay fitted polynomials with degrees 2 and 10 to the data.
Generate \(m=50\) samples of size \(n=50\). Fit polynomials of degree 2 and 10 to every sample.
Using par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomials of degree 2 and 10 based on the whole population (make the colour of the population curves different from the others to make them stand out).
Using var_mutilde function, calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10
Using bias2_mutilde function, calculate the squared bias of the polynomials with degree equal to 2 and 10.
apse_all function, calculate the APSE for complexities equal to 0:10.
Instead of randomly constructing sample and test sets we can use \(k\)-fold cross-validation.
k the number of k-fold, pop a population, xvarname the name of the x variable, and yvarname the of the y variable.Ssamples and Tsamples.rep_len might be helpful.Use the function from part i) and the apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when the complexity=2.
Perform \(k=10\)-fold cross-validation to estimate the complexity parameter from the set \(0:15\). Plot APSE by the complexity and give a conclusion.
\(\;\)
\(\;\)
\(\;\)
To explore the nature of the predictive accuracy of various polynomials, we will use a subset of the o3median attribute from Chicago air pollution data from R package gamair. \(O_3\), also known as Ozone or trioxygen, is a strong oxidant with chlorine-like odour that can cause repiratory damage if inhaled. Hence, \(O_3\) is an air pollutant when it’s present where humans live.
library("gamair")
data(chicago)
n <- 500
x <- 1:n
o3.data <- data.frame(x=1:n, y=chicago$o3median[1:n])
head(chicago) # first few lines of the data set
## death pm10median pm25median o3median so2median time tmpd
## 1 130 -7.4335443 NA -19.59234 1.9280426 -2556.5 31.5
## 2 150 NA NA -19.03861 -0.9855631 -2555.5 33.0
## 3 101 -0.8265306 NA -20.21734 -1.8914161 -2554.5 33.0
## 4 135 5.5664557 NA -19.67567 6.1393413 -2553.5 29.0
## 5 126 NA NA -19.21734 2.2784649 -2552.5 32.0
## 6 130 6.5664557 NA -17.63400 9.8585839 -2551.5 40.0
muhat2 <- getmuhat(o3.data, 2)
muhat10 <- getmuhat(o3.data, 10)
xlim <- extendrange(o3.data[x,])
plot(o3.data,
pch=19, col= adjustcolor("black", 0.5))
curve(muhat2, from = xlim[1], to = xlim[2],
add = TRUE, col="red", lwd=2)
curve(muhat10, from = xlim[1], to = xlim[2],
add = TRUE, col="steelblue", lwd=2)
title(main="red=degree 2 , blue=degree 10")
getSampleComp <- function(pop, size, replace=FALSE) {
N <- dim(pop)[1]
samp <- rep(FALSE, N)
samp[sample(1:N, size, replace = replace)] <- TRUE
samp
}
### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
sampData <- pop[samp, c(xvarname, yvarname)]
names(sampData) <- c("x", "y")
sampData
}
N_S <- 100
set.seed(341) # for reproducibility
n= 20
samps <- lapply(1:N_S, FUN= function(i){getSampleComp(o3.data, n)})
Ssamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", Si, o3.data)})
Tsamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", !Si, o3.data)})
muhats2 <- lapply(Ssamples, getmuhat, complexity = 2)
#getmubar(muhats2)
muhats10 <- lapply(Ssamples, getmuhat, complexity = 10)
#mubar10 <- getmubar(muhats10)
par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomials of degree 2 and 10 based on the whole population (make the colour of the population curves different from the others to make them stand out).par(mfrow=c(1,2))
xvals <- seq(xlim[1], xlim[2], length.out = 200)
plot(o3.data,
pch=19, type='n',
xlab="x", ylab="predictions",
main= " muhats (degree = 2) & mubar")
for (i in 1:N_S) {
curveFn <- muhats2[[i]]
curve(curveFn, from = xlim[1], to = xlim[2], add=TRUE, col=adjustcolor("blue", 0.2), lwd=3, lty=(1))
}
curve(muhat2, from = xlim[1], to = xlim[2],
add=TRUE, col="firebrick", lwd=3)
points(o3.data,
pch=19, col= adjustcolor("black", 0.5))
plot(o3.data,
pch=19, type='n',
xlab="x", ylab="predictions",
main= " muhats (degree = 10) & mubar")
for (i in 1:N_S) {
curveFn <- muhats10[[i]]
curve(curveFn, xlim[1], xlim[2], add=TRUE, col=adjustcolor("blue", 0.2), lwd=3, lty=1)
}
curve(muhat10, xlim[1], xlim[2], add=TRUE, col="firebrick", lwd=3)
points(o3.data,
pch=19, col= adjustcolor("black", 0.5))
var_mutilde function, calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10getmubar <- function(muhats) {
function(x) {
Ans <- sapply(muhats, FUN=function(muhat){muhat(x)})
apply(Ans, MARGIN=1, FUN=mean)
}
}
ave_y_mu_sq <- function(sample, predfun, na.rm = TRUE){
mean((sample$y - predfun(sample$x))^2, na.rm = na.rm)
}
###########
ave_mu_mu_sq <- function(predfun1, predfun2, x, na.rm = TRUE){
mean((predfun1(x) - predfun2(x))^2, na.rm = na.rm)
}
###########
var_mutilde <- function(Ssamples, Tsamples, complexity){
## get the predictor function for every sample S
muhats <- lapply(Ssamples,
FUN=function(sample){
getmuhat(sample, complexity)
}
)
## get the average of these, mubar
mubar <- getmubar(muhats)
## average over all samples S
N_S <- length(Ssamples)
mean(sapply(1:N_S,
FUN=function(j){
## get muhat based on sample S_j
muhat <- muhats[[j]]
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
T_j <- Tsamples[[j]]
ave_mu_mu_sq(muhat, mubar, T_j$x)
}
)
)
}
var_mutilde(Ssamples, Tsamples, complexity=2)
## [1] 39.4854
var_mutilde(Ssamples, Tsamples, complexity=10)
## [1] 5313019
bias2_mutilde function, calculate the squared bias of the polynomials with degree equal to 2 and 10.bias2_mutilde <- function(Ssamples, Tsamples, mu, complexity){
## get the predictor function for every sample S
muhats <- lapply(Ssamples,
FUN=function(sample) getmuhat(sample, complexity)
)
## get the average of these, mubar
mubar <- getmubar(muhats)
## average over all samples S
N_S <- length(Ssamples)
mean(sapply(1:N_S,
FUN=function(j){
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
T_j <- Tsamples[[j]]
ave_mu_mu_sq(mubar, mu, T_j$x)
}
)
)
}
muhat = getmuFun(o3.data, "x", 'y')
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=2)
## [1] 105.7658
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=10)
## [1] 14093.58
apse_all function, calculate the APSE for complexities equal to 0:10.
complexities <- 0:10
apse_vals <- sapply(complexities,
FUN = function(complexity){
apse_all(Ssamples, Tsamples,
complexity = complexity, mu = muhat)
}
)
# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
## complexities apse var_mutilde bias2 var_y
## [1,] 0 1.419895e+02 5.504920e+00 109.74258 41.35165
## [2,] 1 1.497814e+02 1.423684e+01 108.20061 41.35165
## [3,] 2 1.735070e+02 3.948540e+01 105.76581 41.35165
## [4,] 3 9.734785e+01 2.377330e+01 46.07064 41.35165
## [5,] 4 1.521383e+02 7.382526e+01 50.00529 41.35165
## [6,] 5 3.701323e+02 2.983637e+02 43.44568 41.35165
## [7,] 6 8.089953e+03 7.855074e+03 192.62305 41.35165
## [8,] 7 1.921738e+04 1.896188e+04 211.56790 41.35165
## [9,] 8 2.310388e+05 2.280033e+05 2754.64124 41.35165
## [10,] 9 7.649464e+04 7.421728e+04 2059.55748 41.35165
## [11,] 10 5.328469e+06 5.313019e+06 14093.57714 41.35165
plot( complexities, apse_vals[1,], xlab="Degree", ylab="", type='l', ylim=c(0, 6e+06), col="purple", lwd=2 )
lines(complexities, apse_vals[2,], col="blue", lwd=2 )
lines(complexities, apse_vals[3,], col="red", lwd=2)
lines(complexities, apse_vals[4,], col="black", lwd=2)
# The increase in apse is too sharp in higher complexities. Let's zoom in a bit
zoom = 0:5
plot( complexities[zoom], apse_vals[1, zoom], xlab="Degree", ylab="", type='l', ylim=c(0, 400), col="purple", lwd=2 )
lines(complexities[zoom], apse_vals[2, zoom], col="blue", lwd=2 )
lines(complexities[zoom], apse_vals[3, zoom], col="red", lwd=2)
lines(complexities[zoom], apse_vals[4, zoom], col="black", lwd=2)
muhat3 <- getmuhat(o3.data, 3)
xlim <- extendrange(o3.data[x,])
plot(o3.data,
pch=19, col= adjustcolor("black", 0.5),
main="Best fit (degree 3)")
curve(muhat3, from = xlim[1], to = xlim[2],
add = TRUE, col="red", lwd=2)
\(\;\)
\(\;\)
\(\;\)
Bias Variance trade-offs To explore the nature of the predictive accuracy of various polynomials. We construct a number of functions that allowed us to generate data from the same response model and use these to assess the bias-variance tradeoff of our estimators.
In this question, you are going to do the same, but now using the same mechanism used to generate the fake data used in the notes for illustrating different fits.
The fake data was constructed as follows:
# Get some x's
set.seed(341)
N <- 125
x <- runif(N, 0, 1)
x = rep(x, each=2)
mu <- function(x) { sin( 3*pi*x )/x }
# generate some ys
y <- mu(x) + rnorm(N, 0, .6)
# Here's the data
fake.data = data.frame(x=x, y=y)temp = getmuFun(fake.data, "x", 'y')
Plot the population and use the getmuFun function to overlay a function that has the average the \(y\) values when the corresponding \(x\) values are the same.
Plot the population and overlay the true mean function and the function in i).
plot(x, y,
col=adjustcolor("firebrick", 0.4),
pch=19, cex=0.5,
main = "The fake data with, mu(x) = sin( 3*pi*x )/x", xlab="x", ylab="reponse" )
muhat = getmuFun(fake.data, "x", 'y')
temp = getmuFun(fake.data, "x", 'y')
# overlay the mean function
xrange <- extendrange(x)
newx <- seq(min(xrange), max(xrange), length.out = 500)
lines(newx, temp(newx), col=1, type='l')
lines(newx, mu(newx), type='l', col=4)
getmuhat function.### The function getSampleComp will return a logical vector
### (that way the complement is also recorded in the sample)
getSampleComp <- function(pop, size, replace=FALSE) {
N <- popSize(pop)
samp <- rep(FALSE, N)
samp[sample(1:N, size, replace = replace)] <- TRUE
samp
}
### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
sampData <- pop[samp, c(xvarname, yvarname)]
names(sampData) <- c("x", "y")
sampData
}
xvarname <- "x"
yvarname <- "y"
pop <- data.frame(y=y,x=x)
### Construct the sample
samp <- getSampleComp(pop, 50)
sampData <- getXYSample(xvarname, yvarname, samp, pop)
### Set the degree of the polynomial
complexity <- 2
muhat2 <- getmuhat(sampData, 2)
muhat10 <- getmuhat(sampData, 10)
xlim <- extendrange(pop[, xvarname])
plot(sampData,
main=paste0("muhat (p=", complexity,") and its sample"),
xlab = xvarname, ylab = yvarname,
pch=19, col= adjustcolor("black", 0.5))
curve(muhat2, from = xlim[1], to = xlim[2],
add = TRUE, col="steelblue", lwd=2)
curve(muhat10, from = xlim[1], to = xlim[2],
add = TRUE, col="steelblue", lwd=2)
N_S <- 100
set.seed(341) # for reproducibility
n= 40
samps <- lapply(1:N_S, FUN= function(i){getSampleComp(pop, n)})
Ssamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", Si, pop)})
Tsamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", !Si, pop)})
muhats2 <- lapply(Ssamples, getmuhat, complexity = 2)
#getmubar(muhats2)
muhats10 <- lapply(Ssamples, getmuhat, complexity = 10)
#mubar10 <- getmubar(muhats10)
par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomial of degree 2 and 10 based on the whole population.par(mfrow=c(1,2))
xvals <- seq(xlim[1], xlim[2], length.out = 200)
plot(pop[,c(xvarname, yvarname)],
pch=19, col= adjustcolor("black", 0.5),
xlab="x", ylab="predictions",
main= paste0(N_S, " muhats (degree = ", complexity, ") and mubar")
)
for (i in 1:N_S) {
curveFn <- muhats2[[i]]
curve(curveFn, from = xlim[1], to = xlim[2], add=TRUE, col="blue", lwd=3, lty=(i+1))
}
curve(muhat2, from = xlim[1], to = xlim[2],
add=TRUE, col="firebrick", lwd=3)
plot(pop[,c(xvarname, yvarname)],
pch=19, col= adjustcolor("black", 0.5),
xlab="x", ylab="predictions",
main= paste0(N_S, " muhats (degree = ", complexity, ") and mubar")
)
for (i in 1:N_S) {
curveFn <- muhats10[[i]]
curve(curveFn, xlim[1], xlim[2], add=TRUE, col="blue", lwd=3, lty=(i+1))
}
curve(muhat10, xlim[1], xlim[2], add=TRUE, col="firebrick", lwd=3)
var_mutilde calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10 .var_mutilde(Ssamples, Tsamples, complexity=2)
## [1] 0.1288159
var_mutilde(Ssamples, Tsamples, complexity=10)
## [1] 3.694942
bias2_mutilde calculate the squared bias of the polynomials with degree equal to 2 and 10.bias2_mutilde(Ssamples, Tsamples, muhat, complexity=2)
## [1] 1.357051
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=10)
## [1] 0.182035
apse_all calculate the APSE for complexities equal to 0:10.
complexities <- 0:10
apse_vals <- sapply(complexities,
FUN = function(complexity){
apse_all(Ssamples, Tsamples,
complexity = complexity, mu = muhat)
}
)
# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
## complexities apse var_mutilde bias2 var_y
## [1,] 0 15.00455 0.28249 14.12932 0.51809
## [2,] 1 7.60287 0.29261 6.80066 0.51809
## [3,] 2 1.81655 0.12882 1.35705 0.51809
## [4,] 3 1.44675 0.14690 0.99523 0.51809
## [5,] 4 0.58793 0.08385 0.23916 0.51809
## [6,] 5 0.43442 0.05318 0.14097 0.51809
## [7,] 6 0.47345 0.08851 0.14385 0.51809
## [8,] 7 0.54486 0.15736 0.13687 0.51809
## [9,] 8 0.91761 0.51795 0.14446 0.51809
## [10,] 9 0.82807 0.44424 0.12669 0.51809
## [11,] 10 4.14580 3.69494 0.18204 0.51809
plot( complexities, apse_vals[3,], xlab="Degree", ylab="", type='l', ylim=c(0, 2), col="firebrick", lwd=2 )
lines(complexities, apse_vals[2,], xlab="Degree", ylab="", col="steelblue", lwd=2 )
lines(complexities, apse_vals[1,], col="purple", lwd=2)
\(\;\)
\(\;\)
\(\;\)
n=100
set.seed(341)
fake.sample = sample(nrow(fake.data), n)
sample.data = fake.data[fake.sample,]
getmuFun function overlay a function that has the average the \(y\) values when the corresponding \(x\) values are the same.plot(sample.data$x, sample.data$y,
col=adjustcolor("firebrick", 0.4),
pch=19, cex=0.5,
main = "The fake data with, mu(x) = sin( 3*pi*x )/x", xlab="x", ylab="reponse" )
sample.muFun = getmuFun(sample.data, "x", 'y')
curve(temp, from = -1, to =1.2, add=TRUE)
k the number of k-fold, pop a population, xvarname the name of the x variable, and yvarname the of the y variable.Ssamples and Tsamples.rep_len might be helpful.sample.kfold <- function(k=NULL, pop=NULL, xvarname=NULL, yvarname=NULL ) {
N = nrow(pop)
kset = rep_len(1:k, N)
kset = sample(kset)
samps = list()
for (i in 1:k) {
samps[[i]] = logical(N)
samps[[i]][kset != i] = TRUE
}
Ssamples <- lapply(samps,
FUN= function(Si){getXYSample(xvarname, yvarname, Si, pop)})
Tsamples <- lapply(samps,
FUN= function(Si){getXYSample(xvarname, yvarname, !Si, pop)})
list(Ssamples=Ssamples,Tsamples=Tsamples)
}
apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when the complexity=2.kfold.samples = sample.kfold(k=5, pop=sample.data, "x", "y")
apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, complexity = 2, mu = sample.muFun)
## apse var_mutilde bias2 var_y
## 1.56981603 0.01024957 1.23462564 0.69410576
complexities <- 0:10
apse_vals <-sapply(complexities,
FUN = function(complexity){
apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples,
complexity = complexity, mu = sample.muFun) })
# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
## complexities apse var_mutilde bias2 var_y
## [1,] 0 17.60165 0.08744 16.26245 0.69411
## [2,] 1 7.47448 0.07542 6.64126 0.69411
## [3,] 2 1.56982 0.01025 1.23463 0.69411
## [4,] 3 1.38883 0.01035 1.08309 0.69411
## [5,] 4 0.49961 0.00665 0.25935 0.69411
## [6,] 5 0.44107 0.00786 0.21582 0.69411
## [7,] 6 0.43991 0.00855 0.20605 0.69411
## [8,] 7 0.44438 0.01418 0.20408 0.69411
## [9,] 8 0.47047 0.02907 0.20577 0.69411
## [10,] 9 0.43243 0.01430 0.20181 0.69411
## [11,] 10 0.44841 0.02037 0.19582 0.69411
complexities[ apse_vals[2,] == min(apse_vals[2,]) ]
## [1] 4
plot( complexities, apse_vals[3,], xlab="Degree", ylab="", type='l', ylim=c(0, 2), col="firebrick", lwd=2 )
lines(complexities, apse_vals[2,], xlab="Degree", ylab="", col="steelblue", lwd=2 )
lines(complexities, apse_vals[1,], col="purple", lwd=2)
muhats5 <- getmuhat( sample.data, complexity = 5)
y.hat = muhats5(sample.data$x)
R = sample.data$y - y.hat
B = 100
n = nrow(sample.data)
complexities = 0:10
z = numeric(B)
boot.data = sample.data
set.seed(341)
for (i in 1:B) {
#boot.data = sample.data[sample(n, n, replace=TRUE),]
boot.data$y = y.hat + R[sample(n, n, replace=TRUE)]
boot.muFun = getmuFun(boot.data, "x", 'y')
kfold.boot = sample.kfold(k=5, pop=boot.data, "x", "y")
apse.boot <- sapply(complexities,
FUN = function(complexity){
apse_all(kfold.boot$Ssamples, kfold.boot$Tsamples,
complexity = complexity, mu = boot.muFun) })
z[i] = complexities[ apse.boot[2,] == min(apse.boot[2,]) ]
}
hist(z, breaks=0:12 -1/2, xlab="Model Complexity", main="Histogram of Model Complexity", prob=TRUE)
\(\;\)
\(\;\)
\(\;\)
\(\;\)
Bias Variance trade-offs: To explore the nature of the predictive accuracy of various polynomials, we construct a number of functions that allow us to generate data from the same response model and use these to assess the bias-variance trade-off of our estimators. In this question, you are going to do the same for the following model: \[ Y_i = 20\,\log\left(\frac{x^2+5\sin (\pi x)}{x-5}\right) + E_i~~,\quad E_i\sim_{iid} G(0,1) \]
set.seed(341)
N <- 125
# The x values
x <- runif(N, 7,8.5)
x = rep(x, each=2)
mu <- function(x) { 20*log((x^2+5*sin(pi*x))/(x-5)) }
# generate y values
y <- mu(x) + rnorm(N, 0, .2)
# And below is the simulated data
simulated.data = data.frame(x=x, y=y)
getmuhat function. Looking at this plot, which polynomial better fits the data?muhat2 <- getmuhat(simulated.data, 2)
muhat10 <- getmuhat(simulated.data, 10)
xlim <- extendrange(simulated.data[,1])
plot(simulated.data,
pch=19, col= adjustcolor("black", 0.5))
curve(muhat2, from = xlim[1], to = xlim[2],
add = TRUE, col="red", lwd=2)
curve(muhat10, from = xlim[1], to = xlim[2],
add = TRUE, col="steelblue", lwd=2)
title(main="red=degree 2 , blue=degree 10")
Clearly, the degree 10 polynomial fits better, but the question is, would a smaller degree than 10 (but larger than 2) fits just as well?
par(mfrow=c(1,2)) plot all 100 fitted polynomials with degree 2 and all of those with degree 10 on the two different panels. Overlay the two fitted polynomial of degree 2 and 10 based on the whole population in a different colour so that they are visible.getSampleComp <- function(pop, size, replace=FALSE) {
N <- dim(pop)[1]
samp <- rep(FALSE, N)
samp[sample(1:N, size, replace = replace)] <- TRUE
samp
}
### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
sampData <- pop[samp, c(xvarname, yvarname)]
names(sampData) <- c("x", "y")
sampData
}
N_S <- 100
set.seed(341) # for reproducibility
n= 40
samps <- lapply(1:N_S, FUN= function(i){getSampleComp(simulated.data, n)})
Ssamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", Si, simulated.data)})
Tsamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", !Si, simulated.data)})
muhats2 <- lapply(Ssamples, getmuhat, complexity = 2)
#getmubar(muhats2)
muhats10 <- lapply(Ssamples, getmuhat, complexity = 10)
#mubar10 <- getmubar(muhats10)
par(mfrow=c(1,2))
xvals <- seq(xlim[1], xlim[2], length.out = 200)
plot(simulated.data,
pch=19, type='n',
xlab="x", ylab="predictions",
main= " muhats (degree = 2) & mubar")
for (i in 1:N_S) {
curveFn <- muhats2[[i]]
curve(curveFn, from = xlim[1], to = xlim[2], add=TRUE, col="blue", lwd=3, lty=(1))
}
curve(muhat2, from = xlim[1], to = xlim[2],
add=TRUE, col="firebrick", lwd=3)
points(simulated.data,
pch=19, col= adjustcolor("black", 0.5))
plot(simulated.data,
pch=19, type='n',
xlab="x", ylab="predictions",
main= " muhats (degree = 10) & mubar")
for (i in 1:N_S) {
curveFn <- muhats10[[i]]
curve(curveFn, xlim[1], xlim[2], add=TRUE, col="blue", lwd=3, lty=1)
}
curve(muhat10, xlim[1], xlim[2], add=TRUE, col="firebrick", lwd=3)
points(simulated.data,
pch=19, col= adjustcolor("black", 0.5))
var_mutilde calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10. Compare the two numbers.getmubar <- function(muhats) {
function(x) {
Ans <- sapply(muhats, FUN=function(muhat){muhat(x)})
apply(Ans, MARGIN=1, FUN=mean)
}
}
ave_y_mu_sq <- function(sample, predfun, na.rm = TRUE){
mean((sample$y - predfun(sample$x))^2, na.rm = na.rm)
}
###########
ave_mu_mu_sq <- function(predfun1, predfun2, x, na.rm = TRUE){
mean((predfun1(x) - predfun2(x))^2, na.rm = na.rm)
}
###########
var_mutilde <- function(Ssamples, Tsamples, complexity){
## get the predictor function for every sample S
muhats <- lapply(Ssamples,
FUN=function(sample){
getmuhat(sample, complexity)
}
)
## get the average of these, mubar
mubar <- getmubar(muhats)
## average over all samples S
N_S <- length(Ssamples)
mean(sapply(1:N_S,
FUN=function(j){
## get muhat based on sample S_j
muhat <- muhats[[j]]
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
T_j <- Tsamples[[j]]
ave_mu_mu_sq(muhat, mubar, T_j$x)
}
)
)
}
var_mutilde(Ssamples, Tsamples, complexity=2)
## [1] 0.01457752
var_mutilde(Ssamples, Tsamples, complexity=10)
## [1] 0.0124782
There is no clear difference in variability between the two groups, though degree 10 polynomials have a slightly lower variability. Compare the two numbers.
bias2_mutilde calculate the squared bias of the polynomials with degree equal to 2 and 10.bias2_mutilde <- function(Ssamples, Tsamples, mu, complexity){
## get the predictor function for every sample S
muhats <- lapply(Ssamples,
FUN=function(sample) getmuhat(sample, complexity)
)
## get the average of these, mubar
mubar <- getmubar(muhats)
## average over all samples S
N_S <- length(Ssamples)
mean(sapply(1:N_S,
FUN=function(j){
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
T_j <- Tsamples[[j]]
ave_mu_mu_sq(mubar, mu, T_j$x)
}
)
)
}
muhat = getmuFun(simulated.data, 'x', 'y')
bias2_mutilde(Ssamples, Tsamples, muhat,complexity=2)
## [1] 0.1644139
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=10)
## [1] 0.01871003
There is a clear difference (almost an order of magnitude) between the squared biases. Of course, the more complex model (degree 10) has a lower bias.
apse_all function, calculate the APSE for complexities equal to 0:10.
apse_all <- function(Ssamples, Tsamples, complexity, mu){
## average over the samples S
##
N_S <- length(Ssamples)
muhats <- lapply(Ssamples,
FUN=function(sample) getmuhat(sample, complexity)
)
## get the average of these, mubar
mubar <- getmubar(muhats)
rowMeans(sapply(1:N_S,
FUN=function(j){
T_j <- Tsamples[[j]]
muhat <- muhats[[j]]
## Take care of any NAs
T_j <- na.omit(T_j)
y <- T_j$y
x <- T_j$x
mu_x <- mu(x)
muhat_x <- muhat(x)
mubar_x <- mubar(x)
## apse
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
apse <- (y - muhat_x)
## bias2:
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
bias2 <- (mubar_x -mu_x)
## var_mutilde
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
var_mutilde <- (muhat_x - mubar_x)
## var_y :
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
var_y <- (y - mu_x)
## Put them together and square them
squares <- rbind(apse, var_mutilde, bias2, var_y)^2
## return means
rowMeans(squares)
}
))
}
complexities <- 0:10
apse_vals <- sapply(complexities,
FUN = function(complexity){
apse_all(Ssamples, Tsamples,
complexity = complexity, mu = muhat)
}
)
# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
## complexities apse var_mutilde bias2 var_y
## [1,] 0 0.97581 0.01950 0.92848 0.01972
## [2,] 1 0.91632 0.03872 0.84349 0.01972
## [3,] 2 0.20308 0.01458 0.16441 0.01972
## [4,] 3 0.04853 0.00488 0.02295 0.01972
## [5,] 4 0.04826 0.00573 0.02141 0.01972
## [6,] 5 0.04806 0.00636 0.02020 0.01972
## [7,] 6 0.04976 0.00772 0.02018 0.01972
## [8,] 7 0.05052 0.00889 0.01945 0.01972
## [9,] 8 0.05151 0.00990 0.01912 0.01972
## [10,] 9 0.05289 0.01123 0.01892 0.01972
## [11,] 10 0.05429 0.01248 0.01871 0.01972
# Plotting the results
plot( complexities, apse_vals[1,], xlab="Degree", ylab="", type='l',
ylim=c(0, 1), col="purple", lwd=2 )
lines(complexities, apse_vals[2,], col="blue", lwd=2 )
lines(complexities, apse_vals[3,], col="red", lwd=2)
lines(complexities, apse_vals[4,], col="black", lwd=2)
indx = which.min(t(apse_vals)[,'apse'])
complexities[indx]
## [1] 5
The polynomial with degree 5 has the lowest APSE.
\(\;\)
n=100
set.seed(341)
simulated.sample = sample(nrow(simulated.data), n)
sample.data = simulated.data[simulated.sample,]
getmuFun function, overlay a function that has the average \(y\) values when the corresponding \(x\) values are the same.plot(sample.data$x, sample.data$y,
col=adjustcolor("firebrick", 0.4),
pch=19, cex=0.5,
main = "The Sample Data", xlab="x", ylab="reponse" )
sample.muFun = getmuFun(sample.data, "x", 'y')
curve(sample.muFun, from = -6, to =15, add=TRUE)
sample.kfold that creates the \(k\)-fold samples from a given population.i.e.,
k: the number of k-fold, pop: a population, xvarname: the name of the \(x\) variable, and yvarname: the of the \(y\) variable.Ssamples and Tsamples.rep_len might be helpful.Once you have the function, try
TEST = sample.kfold(k=5, pop=sample.data, "x", "y")
par(mfrow=c(1,2))
plot(TEST$Ssamples[[2]])
plot(TEST$Tsamples[[2]])
sample.kfold <- function(k=NULL, pop=NULL, xvarname=NULL, yvarname=NULL ) {
N = nrow(pop)
kset = rep_len(1:k, N)
kset = sample(kset)
samps = list()
for (i in 1:k) {
samps[[i]] = logical(N)
samps[[i]][kset != i] = TRUE
}
Ssamples <- lapply(samps,
FUN= function(Si){getXYSample(xvarname, yvarname, Si, pop)})
Tsamples <- lapply(samps,
FUN= function(Si){getXYSample(xvarname, yvarname, !Si, pop)})
list(Ssamples=Ssamples,Tsamples=Tsamples)
}
TEST = sample.kfold(k=5, pop=sample.data, "x", "y")
par(mfrow=c(1,2))
plot(TEST$Ssamples[[2]])
plot(TEST$Tsamples[[2]])
apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when complexity=3.kfold.samples = sample.kfold(k=5, pop=sample.data, "x", "y")
apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, complexity = 3, mu = sample.muFun)
## apse var_mutilde bias2 var_y
## 0.0482549336 0.0003269455 0.0346875589 0.0110752583
complexities <- 0:10
apse_vals <-sapply(complexities,
FUN = function(complexity){
apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples,
complexity = complexity, mu = sample.muFun) })
# Print out the results
apse.Analysis = t(rbind(complexities, apse=round(apse_vals,5)))
plot( complexities, apse_vals[3,], xlab="Degree", ylab="", type='l',
ylim=c(0, 1.5), col="firebrick", lwd=2 )
lines(complexities, apse_vals[2,], xlab="Degree", ylab="", col="steelblue", lwd=2 )
lines(complexities, apse_vals[1,], col="purple", lwd=2)
We pick a degree 5 polynomial based on cross-validation. Notice that The difference between complexities in this neighbourhood is negligible. Also, one can increase the degrees of polynomials beyond 10 in this plot to better visualize the increase in APSE when the complexity is too large.
muhats4 <- getmuhat(sample.data, complexity = 4)
y.hat = muhats4(sample.data$x)
R = sample.data$y - y.hat
B = 100
n = nrow(sample.data)
complexities = 0:10
z = numeric(B)
boot.data = sample.data
set.seed(341)
for (i in 1:B) {
#boot.data = sample.data[sample(n, n, replace=TRUE),]
boot.data$y = y.hat + R[sample(n, n, replace=TRUE)]
boot.muFun = getmuFun(boot.data, "x", 'y')
kfold.boot = sample.kfold(k=5, pop=boot.data, "x", "y")
apse.boot <- sapply(complexities,
FUN = function(complexity){
apse_all(kfold.boot$Ssamples, kfold.boot$Tsamples,
complexity = complexity, mu = boot.muFun) })
z[i] = complexities[ apse.boot[2,] == min(apse.boot[2,]) ]
}
hist(z, breaks=0:12 -1/2, xlab="Model Complexity", main="Histogram of Model Complexity", prob=TRUE)
We see that a polynomial degree 3 is what this analysis is really suggesting, and the variability around this value is quite small. Reading the APSE plot, there does not seem to be a significant difference between APSE of degree 3 polynomial and that of a degree 5.
\(\;\)
\(\;\)
\(\;\)
\(\;\)
Suppose that the Animals Data is a sample and we are interested in a linear relationship between log(Brain) and log(Body).
library(MASS)
data(Animals2)
plot(log(Animals2$body), log(Animals2$brain),
pch=19, col=adjustcolor("grey", .5),
xlab="log(body)", ylab="log(Brain)")
library(robustbase)
## Warning: package 'robustbase' was built under R version 3.5.2
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.2
data(Animals2)
#Animals2 = Animals2[-c(63,64,65),]
x = log(Animals2$body)
y = log(Animals2$brain)
n =length(y)
plot(x,y, pch=19, col=adjustcolor("grey", .5),
xlab="log(body)", ylab="log(Brain)")
beta.hat = rlm(y ~ x, psi="psi.huber")$coef
abline(coef=beta.hat, col=adjustcolor("firebrick", 1))
\(\;\)
Obtaining the bootstrap samples.
B = 1000
m = 1000
set.seed(341)
beta.boot = t(sapply(1:B, FUN =function(b)
rlm(y~ x, subset=sample(n,n, replace=TRUE), psi="psi.huber", maxit = 200)$coef ))
Plotting the results.
xseq = seq(-6, 12, lengh.out=100)
## Warning: In seq.default(-6, 12, lengh.out = 100) :
## extra argument 'lengh.out' will be disregarded
plot(x, y, pch=19, col=adjustcolor("firebrick", 0.3),
xlab="log(body)", ylab="log(Brain)")
apply(beta.boot, 1, function(coef, x) {
lines(x, coef[1] + x*coef[2], col= adjustcolor("grey", 0.1) )
}, x=xseq)
## NULL
points(x, y, pch=19, col=adjustcolor("firebrick", 0.3))
\(\;\)
a <- 0.01
boot.fit1 = apply(beta.boot, 1, function(coef, x) {
coef[1] + x*coef[2]
}, x=-1)
quantile(boot.fit1, probs=c(a/2, 1-a/2))
## 0.5% 99.5%
## 1.175522 1.785296
a <- 0.01
boot.fit2 = apply(beta.boot, 1, function(coef, x) {
coef[1] + x*coef[2]
}, x=10)
quantile(boot.fit2, probs=c(a/2, 1-a/2))
## 0.5% 99.5%
## 6.576325 9.846972
# computing quantile
xseq = seq(-6, 12, length.out = 50)
boot.fit = apply(beta.boot, 1, function(coef, x) {
coef[1] + x*coef[2]
}, x=xseq)
plot(x, y, pch=19, col=adjustcolor("firebrick", 0.6),
xlab="log(body)", ylab="log(Brain)" )
ci = apply(boot.fit, 1, quantile, c(a/2, 1-a/2))
lines(xseq, ci[1,], lty=2, lwd=2, col=1 )
lines(xseq, ci[2,], lty=2, lwd=2, col=1 )
lines(xseq, beta.hat[1] + xseq*beta.hat[2], col=1)
points(x, y, pch=19, col=adjustcolor("firebrick", 0.6) )
\(\;\)
\(\;\)
\(\;\)
\(\;\)
Obtaining the bootstrap samples.
m = 1000
set.seed(341)
y.hat = beta.hat[1] + beta.hat[2]*x
R = y - y.hat
nonpar.boot.sam = Map(function(b)
{ Rstar = R[sample(n,n,replace=TRUE)]
y = beta.hat[1] + beta.hat[2]*x + Rstar
data.frame( x = x, y= y ) } , 1:B)
beta.boot = t(sapply(nonpar.boot.sam, function(sam) {
rlm(y~ x, data=sam, psi="psi.huber", maxit = 100 )$coef } ))
Plotting the results.
plot(x, y, pch=19, col=adjustcolor("firebrick", 0.3),
xlab="log(body)", ylab="log(Brain)" )
apply(beta.boot, 1, function(coef, x) {
lines(x, coef[1] + x*coef[2], col= adjustcolor("grey", 0.1) )
}, x=xseq)
## NULL
points(x, y, pch=19, col=adjustcolor("firebrick", 0.3))
\(\;\)
a <- 0.01
boot.fit1 = apply(beta.boot, 1, function(coef, x) {
coef[1] + x*coef[2]
}, x=-1)
quantile(boot.fit1, probs=c(a/2, 1-a/2))
## 0.5% 99.5%
## 1.086172 1.712245
a <- 0.01
boot.fit2 = apply(beta.boot, 1, function(coef, x) {
coef[1] + x*coef[2]
}, x=10)
quantile(boot.fit2, probs=c(a/2, 1-a/2))
## 0.5% 99.5%
## 8.449129 9.749525
# computing quantile
xseq = seq(-6, 12, length.out = 50)
boot.fit = apply(beta.boot, 1, function(coef, x) {
coef[1] + x*coef[2]
}, x=xseq)
plot(x, y, pch=19, col=adjustcolor("firebrick", 0.6) )
ci = apply(boot.fit, 1, quantile, c(a/2, 1-a/2))
lines(xseq, ci[1,], lty=2, lwd=2, col=1 )
lines(xseq, ci[2,], lty=2, lwd=2, col=1 )
lines(xseq, beta.hat[1] + xseq*beta.hat[2], col=1)
\(\;\)
\(\;\)
\(\;\)
\(\;\)
To explore the nature of the predictive accuracy of various polynomials, we will use some weekly average air temperatures in Cairo. Load the file cairo_temp.csv
cairo.temp = read.csv("cairo_temp.csv")
head(cairo.temp)
## week temp
## 1 1 59.2
## 2 2 56.5
## 3 3 55.7
## 4 4 56.1
## 5 5 58.4
## 6 6 58.5
names(cairo.temp) = c('x', 'y')
getmuhat <- function(sampleXY, complexity = 1) {
formula <- paste0("y ~ ",
if (complexity==0) "1"
else {
if (complexity < 3 ) {
paste0("poly(x, ", complexity, ", raw = FALSE)")
## due to Numerical overflow
} else {
## if complexity >= 20 use a spline.
paste0("bs(x, ", complexity, ")")
}
}
)
fit <- lm(as.formula(formula), data = sampleXY)
tx = sampleXY$x
ty = fit$fitted.values
range.X = range(tx)
val.rY = c( mean(ty[tx == range.X[1]]),
mean(ty[tx == range.X[2]]) )
## From this we construct the predictor function
muhat <- function(x){
if ("x" %in% names(x)) {
## x is a dataframe containing the variate named
## by xvarname
newdata <- x
} else
## x is a vector of values that needs to be a data.frame
{ newdata <- data.frame(x = x) }
## The prediction
##
suppressWarnings({
ypred = predict(fit, newdata = newdata, silent = TRUE) })
#val = predict(fit, newdata = newdata)
ypred[newdata$x < range.X[1]] = val.rY[1]
ypred[newdata$x > range.X[2]] = val.rY[2]
ypred
}
## muhat is the function that we need to calculate values
## at any x, so we return this function from getmuhat
muhat
}
muhat2 <- getmuhat(cairo.temp, 2)
muhat10 <- getmuhat(cairo.temp, 10)
xlim <- extendrange(cairo.temp[,1])
plot(cairo.temp,
pch=19, col= adjustcolor("black", 0.5))
curve(muhat2, from = xlim[1], to = xlim[2],
add = TRUE, col="red", lwd=2)
curve(muhat10, from = xlim[1], to = xlim[2],
add = TRUE, col="steelblue", lwd=2)
title(main="red=degree 2 , blue=degree 10")
getSampleComp <- function(pop, size, replace=FALSE) {
N <- dim(pop)[1]
samp <- rep(FALSE, N)
samp[sample(1:N, size, replace = replace)] <- TRUE
samp
}
### This function will return a data frame containing
### only two variates, an x and a y
getXYSample <- function(xvarname, yvarname, samp, pop) {
sampData <- pop[samp, c(xvarname, yvarname)]
names(sampData) <- c("x", "y")
sampData
}
N_S <- 50
set.seed(341) # for reproducibility
n= 50
samps <- lapply(1:N_S, FUN= function(i){getSampleComp(cairo.temp, n)})
Ssamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", Si, cairo.temp)})
Tsamples <- lapply(samps, FUN= function(Si){getXYSample("x", "y", !Si, cairo.temp)})
muhats2 <- lapply(Ssamples, getmuhat, complexity = 2)
#getmubar(muhats2)
muhats10 <- lapply(Ssamples, getmuhat, complexity = 10)
#mubar10 <- getmubar(muhats10)
par(mfrow=c(1,2)) plot all the fitted polynomials with degree 2 and 10 on two different figures. Overlay the two fitted polynomials of degree 2 and 10 based on the whole population (make the colour of the population curves different from the others to make them stand out).par(mfrow=c(1,2))
xvals <- seq(xlim[1], xlim[2], length.out = 200)
plot(cairo.temp,
pch=19, type='n',
xlab="x", ylab="predictions",
main= " muhats (degree = 2) & mubar")
for (i in 1:N_S) {
curveFn <- muhats2[[i]]
curve(curveFn, from = xlim[1], to = xlim[2], add=TRUE, col=adjustcolor("blue", 0.2), lwd=3, lty=(1))
}
curve(muhat2, from = xlim[1], to = xlim[2],
add=TRUE, col="firebrick", lwd=3)
points(cairo.temp,
pch=19, col= adjustcolor("black", 0.5))
plot(cairo.temp,
pch=19, type='n',
xlab="x", ylab="predictions",
main= " muhats (degree = 10) & mubar")
for (i in 1:N_S) {
curveFn <- muhats10[[i]]
curve(curveFn, xlim[1], xlim[2], add=TRUE, col=adjustcolor("blue", 0.2), lwd=3, lty=1)
}
curve(muhat10, xlim[1], xlim[2], add=TRUE, col="firebrick", lwd=3)
points(cairo.temp,
pch=19, col= adjustcolor("black", 0.5))
var_mutilde function, calculate the sampling variability of the function of the polynomials with degree equal to 2 and 10getmubar <- function(muhats) {
function(x) {
Ans <- sapply(muhats, FUN=function(muhat){muhat(x)})
apply(Ans, MARGIN=1, FUN=mean)
}
}
ave_y_mu_sq <- function(sample, predfun, na.rm = TRUE){
mean((sample$y - predfun(sample$x))^2, na.rm = na.rm)
}
###########
ave_mu_mu_sq <- function(predfun1, predfun2, x, na.rm = TRUE){
mean((predfun1(x) - predfun2(x))^2, na.rm = na.rm)
}
###########
var_mutilde <- function(Ssamples, Tsamples, complexity){
## get the predictor function for every sample S
muhats <- lapply(Ssamples,
FUN=function(sample){
getmuhat(sample, complexity)
}
)
## get the average of these, mubar
mubar <- getmubar(muhats)
## average over all samples S
N_S <- length(Ssamples)
mean(sapply(1:N_S,
FUN=function(j){
## get muhat based on sample S_j
muhat <- muhats[[j]]
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
T_j <- Tsamples[[j]]
ave_mu_mu_sq(muhat, mubar, T_j$x)
}
)
)
}
var_mutilde(Ssamples, Tsamples, complexity=2)
## [1] 2.973982
var_mutilde(Ssamples, Tsamples, complexity=10)
## [1] 2.953363
bias2_mutilde function, calculate the squared bias of the polynomials with degree equal to 2 and 10.bias2_mutilde <- function(Ssamples, Tsamples, mu, complexity){
## get the predictor function for every sample S
muhats <- lapply(Ssamples,
FUN=function(sample) getmuhat(sample, complexity)
)
## get the average of these, mubar
mubar <- getmubar(muhats)
## average over all samples S
N_S <- length(Ssamples)
mean(sapply(1:N_S,
FUN=function(j){
## average over (x_i,y_i) in a
## single sample T_j the squares
## (y - muhat(x))^2
T_j <- Tsamples[[j]]
ave_mu_mu_sq(mubar, mu, T_j$x)
}
)
)
}
muhat = getmuFun(cairo.temp, "x", 'y')
#cairo.temp$y - muhat(cairo.temp$x)
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=2)
## [1] 96.19658
bias2_mutilde(Ssamples, Tsamples, muhat, complexity=10)
## [1] 13.86907
apse_all function, calculate the APSE for complexities equal to 0:10.
complexities <- 0:15
muhat = getmuFun(cairo.temp, "x", 'y')
apse_vals <- sapply(complexities,
FUN = function(complexity){
apse_all(Ssamples, Tsamples,
complexity = complexity, mu = muhat)
}
)
Print out the results
round( t(rbind(complexities, apse=round(apse_vals,5))),1)
## complexities apse var_mutilde bias2 var_y
## [1,] 0 112.7 1.3 109.5 0
## [2,] 1 115.4 2.5 109.2 0
## [3,] 2 103.2 3.0 96.2 0
## [4,] 3 105.5 4.0 96.1 0
## [5,] 4 56.5 4.6 49.0 0
## [6,] 5 65.1 7.5 52.9 0
## [7,] 6 24.1 5.6 16.4 0
## [8,] 7 22.4 4.0 16.2 0
## [9,] 8 19.9 3.2 14.5 0
## [10,] 9 19.5 3.1 14.0 0
## [11,] 10 19.3 3.0 13.9 0
## [12,] 11 19.5 3.2 13.6 0
## [13,] 12 20.1 3.6 13.5 0
## [14,] 13 20.5 4.0 13.2 0
## [15,] 14 20.7 4.3 13.0 0
## [16,] 15 21.3 4.7 12.9 0
par(mfrow=c(1,2))
plot( complexities, apse_vals[1,], xlab="Degree", ylab="", type='l', ylim=c(0, max(apse_vals) ), col="purple", lwd=2 )
lines(complexities, apse_vals[2,], col="blue", lwd=2 )
lines(complexities, apse_vals[3,], col="red", lwd=2)
lines(complexities, apse_vals[4,], col="black", lwd=2)
# The increase in apse is too sharp in higher complexities. Let's zoom in a bit
zoom = 7:15
plot( complexities[zoom], apse_vals[1, zoom], xlab="Degree", ylab="", type='l', ylim=c(0, max(apse_vals[,zoom]) ), col="purple", lwd=2 )
lines(complexities[zoom], apse_vals[2, zoom], col="blue", lwd=2 )
lines(complexities[zoom], apse_vals[3, zoom], col="red", lwd=2)
lines(complexities[zoom], apse_vals[4, zoom], col="black", lwd=2)
muhat10 <- getmuhat(cairo.temp, 10)
xlim <- extendrange(cairo.temp[,1])
plot(cairo.temp,
pch=19, col= adjustcolor("black", 0.5),
main="Best fit (degree 10)")
curve(muhat10, from = xlim[1], to = xlim[2],
add = TRUE, col="red", lwd=2)
\(\;\)
\(\;\)
\(\;\)
Instead of randomly constructing sample and test sets we can use k-fold cross-validation.
k the number of k-fold, pop a population, xvarname the name of the x variable, and yvarname the of the y variable.Ssamples and Tsamples.rep_len might be helpful.sample.kfold <- function(k=NULL, pop=NULL, xvarname=NULL, yvarname=NULL ) {
N = nrow(pop)
kset = rep_len(1:k, N)
kset = sample(kset)
samps = list()
for (i in 1:k) {
samps[[i]] = logical(N)
samps[[i]][kset != i] = TRUE
}
set.seed(341)
Ssamples <- lapply(samps,
FUN= function(Si){getXYSample(xvarname, yvarname, Si, pop)})
Tsamples <- lapply(samps,
FUN= function(Si){getXYSample(xvarname, yvarname, !Si, pop)})
list(Ssamples=Ssamples,Tsamples=Tsamples)
}
apse function to find an estimate of the APSE using \(k=5\) fold cross-validation when the complexity=2.cairo.temp.muFun = getmuFun(cairo.temp, "x", 'y')
kfold.samples = sample.kfold(k=10, pop=cairo.temp, "x", "y")
apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples, complexity = 2, mu = cairo.temp.muFun)
## apse var_mutilde bias2 var_y
## 100.5682473 0.3267217 95.5191848 0.0000000
complexities <- 0:15
apse_vals <- sapply(complexities,
FUN = function(complexity){
apse_all(kfold.samples$Ssamples, kfold.samples$Tsamples,
complexity = complexity, mu = cairo.temp.muFun) })
# Print out the results
t(rbind(complexities, apse=round(apse_vals,5)))
## complexities apse var_mutilde bias2 var_y
## [1,] 0 110.26060 0.08139 108.69017 0
## [2,] 1 112.40437 0.21025 108.41738 0
## [3,] 2 100.56825 0.32672 95.51918 0
## [4,] 3 102.27981 0.43858 95.50643 0
## [5,] 4 52.59967 0.54551 49.52610 0
## [6,] 5 67.57154 1.05166 59.52932 0
## [7,] 6 16.96717 1.21988 15.15112 0
## [8,] 7 21.17531 0.80229 17.83607 0
## [9,] 8 16.96280 0.63725 14.62305 0
## [10,] 9 15.99556 0.32656 13.97877 0
## [11,] 10 15.96780 0.29304 13.78307 0
## [12,] 11 15.59129 0.32778 13.36052 0
## [13,] 12 15.54063 0.32948 13.21977 0
## [14,] 13 15.56432 0.32155 13.24166 0
## [15,] 14 16.11326 0.45189 13.14555 0
## [16,] 15 16.18850 0.74751 12.68697 0
par(mfrow=c(1,2))
plot( complexities, apse_vals[1,], xlab="Degree", ylab="", type='l', ylim=c(0, max(apse_vals) ), col="purple", lwd=2 )
lines(complexities, apse_vals[2,], col="blue", lwd=2 )
lines(complexities, apse_vals[3,], col="red", lwd=2)
lines(complexities, apse_vals[4,], col="black", lwd=2)
# The increase in apse is too sharp in higher complexities. Let's zoom in a bit
zoom = 7:15
plot( complexities[zoom], apse_vals[1, zoom], xlab="Degree", ylab="", type='l', ylim=c(0, max(apse_vals[,zoom]) ), col="purple", lwd=2 )
lines(complexities[zoom], apse_vals[2, zoom], col="blue", lwd=2 )
lines(complexities[zoom], apse_vals[3, zoom], col="red", lwd=2)
lines(complexities[zoom], apse_vals[4, zoom], col="black", lwd=2)